An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
نویسندگان
چکیده
Different types of OCR errors often occur in texts due to the low quality scanned document images or limitations software. In this paper, we propose a novel unsupervised approach for error correction. Correction candidates are generated and explored their neighborhoods using correction character edits controlled by an adapted hill-climbing algorithm. characters extracted from only original ground truth texts, which do not depend on training data. A weighted objective function used score rank is heuristically tested find optimal weight combinations. The proposed model evaluated text dataset originating Vietnamese handwritten database ICFHR 2018 online recognition competition. also verified concerning its stability complexity. experimental results show that our achieves competitive performance compared other models
منابع مشابه
An Efficient OCR Error Correction Method for Japanese Text Recognition
OCR error correction using Japanese morphological analysis contains two time-consuming procedures: extraction of candidate words from combinations of candidate characters, and finding the most plausible word sequence in combinations of the candidate words. In this paper an optimal word extraction technique, and the use of lexical entries that are tailored for Japanese verb inflection, are inves...
متن کاملStatistical Learning for OCR Text Correction
The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are still prone to suggest correction candidates from limited observations while insufficiently accounting for the characteristics of OCR errors. In this paper, ...
متن کاملAn Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-scanned Texts
OCR (Optical Character Recognition) scanners do not always produce 100% accuracy in recognizing text documents, leading to spelling errors that make the texts hard to process further. This paper presents an investigation for the task of spell checking for OCR-scanned text documents. First, we conduct a detailed analysis on characteristics of spelling errors given by an OCR scanner. Then, we pro...
متن کاملA Statistical Approach to Automatic OCR Error Correction in Context
This paper describes an automatic, context-sensitive, word-error correction system based on statistical language modeling (SLM) as applied to optical character recognition (OCR) postprocessing. The system exploits information from multiple sources, including letter n-grams, character confusion probabilities, and word-bigram probabilities. Letter n-grams are used to index the words in the lexico...
متن کاملOCR Error Correction Using Statistical Machine Translation
In this paper, we explore the use of a statistical machine translation system for optical character recognition (OCR) error correction. We investigate the use of word and character-level models to support a translation from OCR system output to correct french text. Our experiments show that character and word based machine translation correction make significant improvements to the quality of t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2023
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2023.3283340